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Abstract 

We have developed an algorithm for non-parametric fitting and extraction 
of statistically significant peaks in the presence of statistical and systematic 
uncertainties. Applications of this algorithm for analysis of high-energy colli- 
sion data are discussed. In particular, we illustrate how to use this algorithm 
in general searches for new physics in invariant-mass spectra using pp Monte 
Carlo simulations. 
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1 Introduction 



Searches for peaks in particle spectra is a task which is becoming increasingly pop- 
ular at the Large-Hadron collider that focuses on new physics beyond TeV-scale. 
Bump searches can be performed either in single-particle (such as pt distributions) 
or invariant-mass spectra. For instance, searches for new particles decaying into 
a two-body final state (jet-jet, gamma-gamma, etc.) and multi-body decays are 
typically done by examining invariant masses of final-state objects (jets, leptons, 
missing transverse momenta, etc.). For example, assuming seven identified particles 
(jets, photons, electrons, muons, taus, Z-bosons, missing pt), a search can be made 
for parent particles decaying into 2, 3, or 4 daughter particles. This leads to 322 
unique daughter groups. Thus, the task of analyzing such invariant- mass combina- 
tions becomes rather tedious and difficult to handle. Considering a "blind" analysis 
techniques for scanning many channels [1] , any cut variation increases the number of 
channels that need to be investigated. Finally, similar challenges exist for automatic 
searches for new hadronic resonances combining tracks [2\. 

The task of finding bumps is ultimately related to the task of determining a cor- 
rect background shape using theoretical or known cross sections. However, a theory 
can be rather uncertain in the regions of interest, difficult to use for background 
simulation or entirely nonexistent. Even for a simple jet-jet invariant mass, finding 
an analytical background function that fits the QCD-driven background spanning 
many orders in magnitude and which can be used to extract possible excess of events 
due to new physics requires a careful examination. Attempts to fit two-jet and three- 
jet invariant masses have been discussed in CMS ^3^4] and ATLAS [5J papers; while 
both experiments have reached the necessary precision for such fits using initial low- 
statistics data, the used analytical functions are rather different and have many free 
parameters. This task becomes even more difficult considering multiple channels 
(invariant- mass distributions) with various cuts or detector-selection criteria (like 
6-tagging). Each such channel requires a careful selection of analytical functions for 
background fit and adjustments of their initial values for convergence of a non-linear 
regression while determining an expectantly smooth background shape. A fully au- 
tomated approach to searches for new physics has been discussed elsewhere 

One technically attractive approach is to find a non-parametric way to extract 
statistically significant peaks without a prior assumptions on background shapes. 
Such approach is popular in many areas, from image processing to studies of finan- 
cial market, where a typical peak-identification task is reduced to data smoothing 
in order to create a function that attempts to approximate the background. The 
smoothing can be achieved using the moving average [7], Lowess [8J, SPlines [H] 
algorithms. Statistically significant deviations from smoothed distributions can be 
considered as peaks. Such technique is certainly adequate for the peak extraction, 
but it does not pursue the goal of peak identification with a correct treatment of 
statistical (or systematic) uncertainties. 

The closest peak-search approach to high-energy case has been developed for 
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studies of 7-ray spectra where the usual features of interest are the energies and 
intensities of photo-peaks. Several techniques have been developed, such as those 
based on least squares [10], second differences with least-squares fitting [11], Fourier 
transformation [12], Markov chain [13], convolution [l3] (just naming a few). While 
such approaches are well suited for counting-type observables, they typically focus on 
narrow peaks on top of small and often flat-shaped background. In high-energy colli- 
sions, a typical standard-model background distribution has a falling shape spanning 
many orders of magnitude in event counts. A typical example is jet-jet invariant 
masses used for new particle searches [3l[5]. For such spectra, the most interesting 
regions are the tails of the exponentially suppressed distributions where a new high- 
Pt physics may show up. This means that there should be rather different thresholds 
to statistical noise, depending on the phase-space region, and as the result, a correct 
treatment of statistical and systematic uncertainties is obligatory. Unlike the 7-ray 
spectra where peaks are rather common and subject of various classification tech- 
niques, peaks in high-energy collisions are rather rare. As a consequence, relatively 
little progress has been made to develop a non-parametric fitting technique for high- 
energy physics apphcations where an observation of peaks is typically a subject for 
searches for new physics rather than for peak-classification purposes. 

The above discussion leads to the need for a non-parametric way of background 
estimation together with the peak extraction mechanism which can be suited for 
high-energy collision distributions, such as invariant masses. The algorithm should 
be able to take into account the discrete nature of input distributions with their 
statistical (and possibly systematic) uncertainties. 

2 Non-parametric peak finder algorithm 

Due to the reasons discussed above, the program called Non-Parametric Peak Finder 
(NPFinder) was developed using a numerical, iterative approach to detect statisti- 
cally significant peaks in event-counting distributions. In short, NPFinder iterates 
through bins of input histograms and, using only one sensitivity parameter, deter- 
mines the location and statistical significance of possible peaks. Unlike the known 
smoothing algorithms, the main focus of this method is not how to smooth data 
and then extract peaks, but rather how to extract peaks by comparing neighboring 
points and then calling what is left over the "background". Below we discuss the 
major elements of this algorithm and then we illustrate and discuss its limitations 
and possible improvements. 

For each point i in a histogram, the first-order derivative is found taking 
into account possible (statistical or/and systematic) uncertainties. This is done by 
calculating the slope between two points including their experimental uncertainties: 
If point z + 1 is lower than point i, the upper error is used, while if point i + 1 is 
higher than point i, the lower error uncertainty is used. This is done in order to be 
always on a conservative side while reducing statistical noise. Mathematically, this 
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can be written as: 



Vi+i + 6yi+i - yi 
0!i = , (1) 

where the uncertainty Syi+i is taken with negative sign for > yi, and with 
positive sign otherwise. The uncertainly may not need to be symmetric; but for 
simphcity we assume that they are symmetric as this is usually the case for statistical 
nature of uncertainties. The derivatives are averaged calculating a running average 
for any given position A^: 



The algorithm triggers the beginning of a peak if the local derivatives satisfy: 

SaN+l = ftAT+i - Oat > A, , . 

5aN+2 = aAr+2 — Oat > A, 

where A is a free positive parameter. When the above conditions are true, NPFinder 
registers a possible peak and begins classifying next points as a part of the peak. 
The running average Eq. [2] is not accumulated for the points which belong to a 
possible peak. A is the only free parameter which specifies the sensitivity to the 
peak finding. Typically, it should be close to 1 and decreasing it to a smaller value 
will increase sensitivity to the peaks (and likely will increase sensitivity to statistical 
fluctuations). 

NPFinder continues to walk over data points until daj\f+i and 5aj\f+2 are both 
negative, which signifies the maximum of the peak has been reached. The double 
condition in Eq. [3] is used to reinforce the peak-search robustness. When this con- 
dition is met, NPFinder exits the peak and adds an equal number of points to the 
right side from the peak center. The requirement of having the same number of 
points implies that the peaks are expected to be symmetric (which is typically the 
case). For steeply falling distributions, as in the most common case for high-energy 
physics, this assumption usually means that we somewhat underestimate the peak 
significance. 

After detecting all peak candidates, NPFinder iterates through the list of possible 
peaks in order to form a background for each peak. This is achieved by performing a 
linear regression of points between the first and last points in the peak, i.e. applying 
the function yi = mxi + b, where m and b are the slop and intercept of the linear 
regression, which in this case is rather trivial as it is performed via the two points 
only. It should be noted that the linear regression is also performed taking into 
account uncertainties: 

(2/2 + Sy2) - (2/1 + 6yi) 
m = , 

X2 — Xi 

where yi is the first point of the peak, y2 is the final point of the peak, and 6yi and Sy2 
are their statistical uncertainties, respectively. Here the statistical uncertainties are 
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added in order to always be on the conservative side in estimation of the background 
level under the peak. The intercept parameter then is b = yi + 6yi — mxi. 

It should be mentioned that the technique of peak finding considered above is 
somewhat similar to that discussed for 7-ray applications |[1]J. But there are several 
important differences: the peaks can have arbitrary shapes (not only a Gaussian 
form), no fitting or smoothing procedure is used and errors on data points are 
included at the iteration step. 

Finally, NPFinder uses the background points to calculate the statistical signifi- 
cance of each peak in a given histogram. This is done by summing up the differences 
Ti of the original points in a peak with respect to the calculated background points, 
and then dividing this value by it's own square root. For a given peak, it can be 
approximated by: 



where the sum runs over all points in the peaks. The algorithm runs over an input 
histogram or graph, builds a list of peaks and estimates their statistical significance. 
A typical statistically significant peak in this approach has a > 5 — 7. A first peak is 
usually ignored as it corresponds to the kinematic peak of background distributions. 

Below we illustrate the above approach by generating fully inclusive pp collision 
events using the PYTHIA generator |15j. The required integrated luminosity was 
200 ph~\ Jets are reconstructed with the anti-Air algorithm with a distance 
parameter of 0.6 using the cut pt > 100 GeV. Currently, this jet algorithm is 
the default at the ATLAS and CMS experiments. Then, the dijet invariant-mass 
distribution is calculated and the NPFinder finder is applied using the parameter 
A = 1. As expected, no peaks with a > 5 were found. 

Finally, a few fake peaks were generated using Gaussian distributions with dif- 
ferent peak positions and widths. The peaks were added to the original background 
histogram. Figure [1] shows an example with 3 peaks generated at 1000 GeV (20 GeV 
width, 200000 events), 1500 (50 GeV widths, 30000 events) and at 2800 GeV (40 
GeV widths, 1200 events). The algorithm found all three peaks and gave correct 
estimates of their positions, widths and approximate statistical significance using 
the input parameter A = 1. 

It should be noted that the peak statistical significance of the proposed non- 
parametric method might be smaller than that calculated using more conventional 
approaches, such as those based on a minimisation with appropriate background 
and signal functions. This can be due to the assumption on the symmetric form of 
the extracted peaks, the linear approximation for the background under the peaks, 
and the way in which the uncertainty is incorporated in the peak-significance calcu- 
lation. An influence of experimental resolution can also be an issue [17] which can 
only be addressed via correctly identified signal and background functions. Such 
drawbacks are especially well seen for the highest-mass peak shown in Figure [T] 
where a statistical fluctuation to the right of the peak pulls the background level 
up compared to the expected falling shape. Given the approximate nature of the 
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statistical significance calculations which only serve to trigger attention of analyzers 
who need to study the found peaks in more detail, the performance of the algorithm 
was found to be reasonable. 

It should be noted that there is a correlation between the peak width and the 
input parameter A: a detection of broader peaks typically requires a smaller value 
of A (which can be as low as 0.2). 

In conclusion, a peak-detection algorithm has been developed which can be used 
for extraction of statistically significant peaks in event-counting distributions tak- 
ing into account statistical (and potentially systematic) uncertainties. The method 
can be used for new physics searches in high-energy particle experiments where a 
correct treatment of such uncertainties is one of the most important issues. The 
non-parametric peak finder has only one free parameter which is fairly independent 
of input background distributions. The algorithm was tested and found to per- 
form well. The code is implemented in the Python programming language with the 
graphical output using either ROOT (C-I-+) [18] or jHepWork (Java) ^19j. The code 
example is available for download [20] . 

Acknowledgements 

We would like to thank J. Proudfoot for discussion and comments. The submitted 
manuscript has been created by UChicago Argonne, LLC, Operator of Argonne 
National Laboratory ("Argonne"). Argonne, a U.S. Department of Energy Office of 
Science laboratory, is operated under Contract No. DE-AC02-06CH11357. 



5 



(0 

c 
□J 



10° 



10' 



10° 



10-= 



10 





T 


Peak 


\ 


H 


- Background 








.-V 


~ 





, , , ,t, IT, ,",11,1 



500 1000 1500 2000 2500 3000 3500 4000 

M(jet-jet) [GeV] 



Figure 1: Invariant mass of two jets generated with the PYTHIA Monte Carlo 
model. Several peaks seen in this figure were added using Gaussian distributions 
with different widths and peak values (see the text). The peaks are found using the 
NPFinder algorithms which also estimates their statistical significances as discussed 
in the text. 



6 



References 

[1] G. Choudalakis (2008). Ph.D. MIT Thesis, 2008, arXiv:0805.3954. 

[2] S. Chckanov, A C++ framework for automatic search and identification of 
resonances, in Proceedings of the HERA and the LHC: A Workshop on the im- 
plications of HERA for LHC physics. Part B, p. 624. CERN-2005-014, DESY- 
PROC-2005-01, 2005. 

[3] CMS Collaboration, V. Khachatryan, et al., Phys. Rev. Lett. 105 (2010) 211801. 

[4] CMS Collaboration, S. Chatrchyan, et al. (2011). ArXiv:1107.3084. 

[5] ATLAS Collaboration, G. Aad, et al. New J. Phys. 13 (2011) 053044. 

[6] CDF Collaboration, T. Aaltonen, et al., Phys. Rev. D79 (2009) 011101. 

[7] E. S. Kenney, J. F.and Keeping, Mathematics of Statistics, Pt. 1, 3rd ed. 1962. 

[8] W. S. Cleveland, Journal of the American Statistical Association 74 (1979) 829. 

[9] J. Schoenberg, Quart. Appl. Math. 4 (1946) 45 and 112. 

[10] F. Huang, C. Osman, T. R. Ophel, Nucl. Instr. and Methods 68 (1969) 141 . 

[11] M. Mariscotti, Nucl. Instr. and Methods 50 (1967) 309 . 

[12] K. J. Bhnowska, E. F. Wessner, Nucl. Instr. and Methods 118 (1974) 597 . 

[13] Z. K. Silagadze, Nucl. lustrum, and Methods A376 (1996) 451. 

[14] M. Morhac, J. Kliman, V. Matousek, M. Veselsky, I. Turzo, Nucl. Instr. and 
Methods in Physics Research A 443 (2000) 108. 

[15] T. Sjostrand, S. Mrenna, P. Z. Skands, JHEP 05 (2006) 026. 

[16] M. Cacciari, G. P. Salam, G. Soyez, JHEP 04 (2008) 063. 

[17] S. V. Chekanov, B. B. Levchenko, Phys. Rev. D 76 (2007) 074025. 

[18] I. Antcheva, et al., Comput. Phys. Commun. 180 (2009) 2499. 

[19] S. Chekanov, Scientific Data Analysis Using Jython Scripting and 
Java. Springer- Verlag, London, UK, 2010. The program available in 
http://jwork.org/jhepwork/. ISBN 978-1-84996-286-5. 

[20] M. Erickson, S. Chekanov. https:/ /atlaswww.hep. anl.gov/asc/packages/NPFinder/. 



7 



